Closed Bug 1137050 Opened 9 years ago Closed 9 years ago

FF36 startup crash in @0x0 with EMET, via js::InterruptRunningJitCode

Categories

(Core :: JavaScript Engine, defect)

36 Branch
x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla39
Tracking Status
firefox36 + fixed
firefox37 --- fixed
firefox38 --- fixed
firefox39 --- fixed
relnote-firefox --- 36+

People

(Reporter: away, Assigned: away)

Details

(Keywords: crash)

Crash Data

Attachments

(1 file, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-cfd70f22-66f5-4e26-8a66-53cb62150224.
=============================================================

Socorro couldn't figure out the stack, but my debugger says:

0653f264 0653f2a4 0x0
0653f2ac 5ed94c82 0x653f2a4
0653f590 5e8e6aa7 xul!js::InterruptRunningJitCode+0x75
0653f594 5eb60751 xul!JSRuntime::requestInterrupt+0x25
0653f5b0 5ec2879b xul!mozilla::dom::workers::WorkerPrivateParent<mozilla::dom::workers::WorkerPrivate>::DispatchControlRunnable+0x7b
0653f5c8 5e9d1889 xul!mozilla::dom::workers::WorkerControlRunnable::DispatchInternal+0x55
0653f604 5ebf5fd7 xul!mozilla::dom::workers::WorkerRunnable::Dispatch+0x130
0653f614 5ebf5e36 xul!`anonymous namespace'::TimerThreadEventTarget::Dispatch+0x23
0653f654 5ead96b7 xul!nsTimerImpl::PostTimerEvent+0x10c
0653f704 6e178afb xul!TimerThread::Run+0x35b
0653f70c 5ead83d9 nss3!PR_GetCurrentThread+0x1b
0653f824 5ec27003 xul!nsThread::ProcessNextEvent+0x631
0653f854 5ead735e xul!mozilla::ipc::MessagePumpForNonMainThreads::Run+0x7f
0653f88c 5ead7845 xul!MessageLoop::RunHandler+0x20
0653f8ac 5ea70a32 xul!MessageLoop::Run+0x19
0653f8c4 6e174578 xul!nsThread::ThreadFunc+0x87
0653f8e0 6e17434f nss3!_PR_NativeRunThread+0x8c
0653f8e8 6e30c01d nss3!pr_root+0xb
0653f920 6e30c001 msvcr120!_beginthreadex+0xb4
0653f92c 7648919f msvcr120!_endthreadex+0x102
0653f938 773fb5af kernel32!BaseThreadInitThunk+0xe
0653f980 773fb57a ntdll!__RtlUserThreadStart+0x2f
0653f990 00000000 ntdll!_RtlUserThreadStart+0x1b

These crashes are within a few seconds of startup and all of them seem to have EMET v5.0.0.0.

It looks like a new regression in 36. This signature also happens on 35.0.1 but those are likely a different issue (no EMET, different stack).
[Tracking Requested - why for this release]: 

The numbers are early but this is currently #8/1.1% on 36. With more data, this may fall off the charts entirely, or it may stick around. I don't know. 

Since it's a startup crash, I'd rather err on the side of caution and get this on your radar.
I want to say this was provoked by bug 1099339, but I'm not super sure. Luke, what do you think?
Flags: needinfo?(luke)
Sounds quite likely.  What is EMET?  I'm guessing a security feature that makes SuspendThread() crash?

Is this just a matter of requesting additional permissions in the flags to OpenThread():
  https://hg.mozilla.org/releases/mozilla-release/file/33bd422806be/js/src/vm/Runtime.cpp#l262
so we have permission to SuspendThread(), or does EMET just disable SuspendThread?  If the latter, then can we detect EMET so we can have EnsureSignalHandlersInstalled() return false:
  https://hg.mozilla.org/releases/mozilla-release/file/33bd422806be/js/src/asmjs/AsmJSSignalHandlers.cpp#l1003
?  If so, then we can mitigate this problem pretty easily (at the cost of slightly slower JS JIT code on EMET).
Flags: needinfo?(luke)
It's a Microsoft utility that does a bunch of security hardening. I don't know if this is a crash-in-defense or just a bug. I'm trying to spin up a machine to repro.

Also interesting is that we haven't seen this crash with EMET v5.1. It looks like 5.0 had some issues with IE: http://blogs.technet.com/b/srd/archive/2014/11/10/emet-5-1-is-available.aspx
Or perhaps, reading the comments in the crash report, this is the "StackPivot" prevention which I'm guessing doesn't like the SetThreadContext().  It sounds like StackPivot is on a list of mitigations that can be enabled/disabled by an admin and it defaults to 'on'.
Reading http://support.microsoft.com/kb/3015976, I see "Mozilla Firefox" is also on the list of browsers that were fixed by 5.1.
Hmm, it is possible this still happens with 5.1: The timestamp on the EMET.dll in the dumps are from November 2014, which was when 5.1 was released. Maybe they didn't rev their FileVersion.
I installed EMET 5.1 and got the same claims-to-be-5.0 DLL with the same timestamp as the crash report. It hasn't yet crashed for me though, not even with StackPivot.
Startup crash, tracking.
David, do you think this bug is actionable? We have a likely driver for 36.0.1.
Flags: needinfo?(dmajor)
I still haven't been able to reproduce this, but, I have a theory:

It does indeed look like we're failing in SetThreadContext. The EMET GUI says the StackPivot alarm trips "if the stack pointer is changed to point to attacker-controlled memory areas." Does JIT code do funny things with esp/ebp? What if we're doing GetThreadContext, and esp happens to be in a weird state, then we SetThreadContext right back, and EMET complains?
Flags: needinfo?(dmajor) → needinfo?(luke)
Nope; you can follow the CONTEXT pointer and see that only pc is ever mutated, and only when interrupting asm.js (which is not happening during startup).  In the case of interrupted Ion, the CONTEXT will be identical to what came out of GetThreadContext.  I expect it's simply the act of calling SetThreadContext that triggers this.  If we need a mitigation; is there any way to detect EMET?
Flags: needinfo?(luke)
Yeah, I know that our code doesn't change that part of the CONTEXT. I am thinking what if the value is already weird, and we're SetThreadContext that same unchanged value, but it's still enough to make EMET unhappy?

We could probably test GetModuleHandle("EMET.dll") -- but then again so could bad guys, so I dunno if they deliberately intercept that.
Ah, that's a good theory; when we SuspendThread, we could be *anywhere* the main thread can reach.  A good experiment would be to move the SetThreadContext to right after the "*ppc = module.interruptExit();" in RedirectJitCodeToInterruptCheck; this is guaranteed to only happen when we're in JIT code (asm.js JIT code actually, so even if the crash wasn't fixed, it'd be a lot more rare).  It'd be really nice if there were a way to repro the crash and fix (both startup and with asm.js content).
Are there any intensive asm.js demos that don't use WebGL? (not available on my VM)
Something like this?
Attachment #8570296 - Flags: review?(luke)
I found a machine that can do WebGL but still couldn't repro the crash. So the fix is speculative.

Is it safe enough to put on m-r? (Although, maybe we don't need it if the crash rate keeps dropping. It is down to #40/0.28%)
Comment on attachment 8570296 [details] [diff] [review]
Speculative fix/mitigation - Don't SetThreadContext if the context didn't change

Exactly, yes.

>+            if (RedirectJitCodeToInterruptCheck(rt, &context)) {
>+                SetThreadContext(thread, &context);
>+            }

Nit for landing: no curlies around single-line then-branch in SM.
Attachment #8570296 - Flags: review?(luke) → review+
Without the curlies
Attachment #8570296 - Attachment is obsolete: true
Attachment #8570311 - Flags: review+
Tree's closed, and I'll be out tomorrow.
Keywords: checkin-needed
(In reply to David Major [:dmajor] (UTC+13) from comment #17)
> Is it safe enough to put on m-r? (Although, maybe we don't need it if the
> crash rate keeps dropping. It is down to #40/0.28%)
I cannot evaluate if it is safe or not but I guess it is expected that the crash rate is dropping when dealing with startup crash, right?
Assignee: nobody → dmajor
(In reply to David Major [:dmajor] (UTC+13) from comment #15)
> Are there any intensive asm.js demos that don't use WebGL? (not available on
> my VM)

http://kripken.github.io/Massive (specifically the Throughput section).

(In reply to Sylvestre Ledru [:sylvestre] from comment #21)
> I cannot evaluate if it is safe or not but I guess it is expected that the
> crash rate is dropping when dealing with startup crash, right?

I think in general you are right, but, from my brief reading about EMET, it sounds like it is enterprise-deployed and allows per-app, per-mitigation control (specifically for cases where EMET is crashing valid programs), so another explanation is that the admins got reports of crashes and just disabled the StackPivot mitigation for Firefox.  A bunch of crash-report comments (I know, not representative) specifically named EMET+StackPivot so I assume EMET must report this information.
(In reply to David Major [:dmajor] (UTC+13) from comment #17)
> (Although, maybe we don't need it if the
> crash rate keeps dropping. It is down to #40/0.28%)

That's normal for startup crashes, as people who can't launch Firefox will go back to an old version or a different browser - and we turned off automated updates yesterday per our usual process (to look at data before opening the floodgates for everyone) so few new people are exposed to this.
So, what should we accept this patch in 36.0.1 or it is too risky? Thanks
Flags: needinfo?(luke)
Flags: needinfo?(dmajor)
I think the risk is very low, since one can locally audit the flow of 'context' and see that we're calling SetThreadContext() in the one case it's modified.
Flags: needinfo?(luke)
Ok, could you fill the uplift request to aurora, beta & release? Thanks
dmajor: do we have confirmation that this fixes the problem?
https://hg.mozilla.org/mozilla-central/rev/f55fa183f081
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
(In reply to Luke Wagner [:luke] from comment #28)
> dmajor: do we have confirmation that this fixes the problem?

I still haven't been able to reproduce the crash myself, and we don't have enough volume on Nightly to confirm this from crash-stats.

I think the only way we'll get confirmation is from the bug reporters. I've just sent an email to a few people who reported the crash, asking them to test the tinderbox build from comment 29.
Flags: needinfo?(dmajor)
Comment on attachment 8570311 [details] [diff] [review]
Speculative fix/mitigation - Don't SetThreadContext if the context didn't change

Approval Request Comment
[Feature/regressing bug #]: Probably bug 1091912
[User impact if declined]: Startup crashes with EMET
[Describe test coverage new/current, TreeHerder]: green on m-i
[Risks and why]: This should be low risk. Functionally (from the JS engine's perspective) there should be no change in behavior. The only difference is that EMET doesn't observe the API call as often.
[String/UUID change made/needed]: none
Attachment #8570311 - Flags: approval-mozilla-release?
Attachment #8570311 - Flags: approval-mozilla-beta?
Attachment #8570311 - Flags: approval-mozilla-aurora?
Comment on attachment 8570311 [details] [diff] [review]
Speculative fix/mitigation - Don't SetThreadContext if the context didn't change

Approving for uplift based on conversation with Sylvestre.
Attachment #8570311 - Flags: approval-mozilla-release?
Attachment #8570311 - Flags: approval-mozilla-release+
Attachment #8570311 - Flags: approval-mozilla-beta?
Attachment #8570311 - Flags: approval-mozilla-beta+
Attachment #8570311 - Flags: approval-mozilla-aurora?
Attachment #8570311 - Flags: approval-mozilla-aurora+
I heard back from several users who are crashing with 36. None of them could reproduce the issue on the nightly before this fix. It makes me suspect that the issue is sensitive to minute differences in the process environment that can change between builds and machines (and maybe that's why I couldn't repro on 36). I think it's still good to have the fix on release, to reduce the chance of this happening again in the next build (possibly to a different set of machines).
Added to the release notes with "36.0.1 - Fixed a startup crash with EMET (1137050)" as wording
I never had the crash with 36.0 and I use EMET, I checked and I am using v5.1.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: